Setup

library(ngsReports)
library(magrittr)
library(scales)
library(pander)
library(tidyverse)
if (interactive()) setwd(here::here("R"))
rawFqc <- list.files("../0_rawData/FastQC/", pattern = "zip", full.names = TRUE) %>%
    FastqcDataList()
trimFqc <- list.files("../1_trimmed/FastQC/", pattern = "zip", full.names = TRUE) %>%
    FastqcDataList()
deMuxFqc <- list.files("../2_demux/FastQC/", pattern = "zip", full.names = TRUE) %>%
    FastqcDataList()
origFqc <- list.files("../previousAnalysis/FastQC/", pattern = "fastqc.zip", full.names = TRUE) %>%
    FastqcDataList()
samples <- read_tsv("../0_rawData/samples.tsv")

Base Qualities

Each of the libraries was inspected for overall quality. Positions 3 & 4 from R2 libraries FCC21WPACXX-CHKPEI13070002_L6 and FCC21WPACXX-CHKPEI13070003_L7 showed clear problems with read qualities. This was likely due to an unsatisfactory nucleotide diversity in the original sequencing run and not enough phiX to overcome this, particularly as all R2 reads will begin with the restriction site (i.e. fragments will terminate with this as there is no R2 barcode).

Base qualities before trimming

Base qualities before trimming

Base qualities after trimming

Base qualities after trimming

Read Totals

Results from adapter removal. Reads < 70bp after trimming were discarded during this process
Library Raw Trimmed Retained Discarded
FCC229TACXX-CHKPEI13070001_L3 56,981,567 55,762,854 97.86% 2.14%
FCC2GPDACXX-CHKPEI13070007_L6 66,601,007 64,940,945 97.51% 2.49%
FCC2GPDACXX-CHKPEI13070004_L2 78,411,565 76,201,274 97.18% 2.82%
FCC2GPDACXX-CHKPEI13070006_L4 70,874,412 68,811,191 97.09% 2.91%
FCC21WPACXX-CHKPEI13070003_L7 65,481,942 63,542,676 97.04% 2.96%
FCC2GPDACXX-CHKPEI13070005_L3 78,445,394 76,021,945 96.91% 3.09%
FCC21WPACXX-CHKPEI13070002_L6 69,729,067 67,032,866 96.13% 3.87%
Total 486,524,954 472,313,751 97.1% 2.92%

The first check after demultiplexing is to ensure that read were not assigned to multiple individuals by sabre pe, given that one mismatch was permitted per barcode/restriction-site. Read Totals before and after demultiplexing were then checked and the recovery rate was >93% for each library, with the clear exception of FCC21WPACXX-CHKPEI13070002_L6. Manual inspection revealed that a significant majority (~9.3x106 of 13.4x106) of these non-recovered reads contained truncated barcodes with the first two bases missing. If an additional recovery step was taken this would increase the recovery rate to 93% for this library, and may yield an additional 5-600,000 reads per sample. For the next most poorly recovered library, an additional step may recover an additional 3x106 reads. However, no further action was taken at this point.

trimReadTotals <- readTotals(trimFqc) %>%
    mutate(Library = str_remove_all(Filename, "_[12].fq.gz")) %>%
    distinct(Library, Total_Sequences)
deMuxReadTotals <- readTotals(deMuxFqc) %>%
    mutate(ID = str_remove_all(Filename, ".[12].fq.gz")) %>%
    distinct(ID, Total_Sequences) %>%
    left_join(samples, by = "ID") %>%
    group_by(Library) %>%
    summarise(DeMultiplexed = sum(Total_Sequences)) %>%
    filter(!is.na(Library))
Recovery rate from demultiplexing the 1996 and 2012 samples, after adapter removal
Library Total Sequences DeMultiplexed Recovery Rate
FCC21WPACXX-CHKPEI13070002_L6 67,032,866 53,568,147 79.9%
FCC2GPDACXX-CHKPEI13070005_L3 76,021,945 70,702,537 93.0%
FCC2GPDACXX-CHKPEI13070006_L4 68,811,191 66,677,690 96.9%
FCC229TACXX-CHKPEI13070001_L3 55,762,854 55,037,971 98.7%
FCC21WPACXX-CHKPEI13070003_L7 63,542,676 62,787,380 98.8%
FCC2GPDACXX-CHKPEI13070004_L2 76,201,274 75,316,445 98.8%
FCC2GPDACXX-CHKPEI13070007_L6 64,940,945 64,349,296 99.1%
Total 472,313,751 448,439,466 94.9%

Read Totals for each of the 1996 & 2012 samples. The black line indicates the mean library size, whilst the dashed green line indicates the mean library size based on the original library before demultiplexing. Bar colours indicate sample population as 1996 (blue) or 2012 (red).

In summary, of the 486,524,954 initial reads, a total of 448,439,466 were retained after trimming and demultiplexing. This represents a combined recovery rate of 92.2% from the initial libraries.

Comparison against previous analysis

The revised strategy showed considerable improvement in yield for the libraries FCC21WPACXX-CHKPEI13070002_L6 and FCC21WPACXX-CHKPEI13070003_L7 which both contained exculsively 1996 samples. As such the revised strategy was pursued further.

Comparison of library sizes after demultiplexing using sabre (blue), vs the original method (salmon).

Comparison of library sizes after demultiplexing using sabre (blue), vs the original method (salmon).

Changes in library size using the improved recovery strategy

Changes in library size using the improved recovery strategy

Comparison of library sizes for the three populations.

Comparison of library sizes for the three populations.

GC Content

GC content for all 1996 and 2012 samples.

Inspection by GC content showed that gc2709 and gc2700 appeared to have an exaggerated peak around 60%, whilst all other samples showed a more broad spread across the range. From the Turretfield samples (previous analysis), pt1125 showed an unexpected peak around 50% indicating that sample may contain reads from a different species. This sample should be excluded from all further analysis.

GC content for all Turretfield samples.

Read Lengths

Length Distribution for all files

Sequence Content

Sequence content showing the presence of the RS in both the Turretfield and R2 samples.

Conclusion